INTERSPEECH.2009 - Speech Recognition

Total: 165

#1 Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction [PDF] [Copy] [Kimi1]

Authors: Chanwoo Kim ; Richard M. Stern

This paper presents a new feature extraction algorithm called Power-Normalized Cepstral Coefficients (PNCC) that is based on auditory processing. Major new features of PNCC processing include the use of a power-law nonlinearity that replaces the traditional log nonlinearity used for MFCC coefficients, and a novel algorithm that suppresses background excitation by estimating SNR based on the ratio of the arithmetic to geometric mean power, and subtracts the inferred background power. Experimental results demonstrate that the PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for various types of additive noise. The computational cost of PNCC is only slightly greater than that of conventional MFCC processing.

#2 Towards fusion of feature extraction and acoustic model training: a top down process for robust speech recognition [PDF] [Copy] [Kimi1]

Authors: Yu-Hsiang Bosco Chiu ; Bhiksha Raj ; Richard M. Stern

This paper presents a strategy to learn physiologically-motivated components in a feature computation module discriminatively, directly from data, in a manner that is inspired by the presence of efferent processes in the human auditory system. In our model a set of logistic functions which represent the rate-level nonlinearities found in most mammal hearing system are put in as part of the feature extraction process. The parameters of these rate-level functions are estimated to maximize the a posteriori probability of the correct class in the training data. The estimated feature computation is observed to be robust against environmental noise. Experiments conducted with the CMU Sphinx-III on the DARPA Resource Management task show that the discriminatively estimated rate-nonlinearity results in better performance in the presence of background noise than traditional procedures which separate the feature extraction and model training into two distinct parts without feed back from the latter to the former.

#3 Temporal modulation processing of speech signals for noise robust ASR [PDF] [Copy] [Kimi1]

Authors: Hong You ; Abeer Alwan

In this paper, we analyze the temporal modulation characteristics of speech and noise from a speech/non-speech discrimination point of view. Although previous psychoacoustic studies [3][10] have shown that low temporal modulation components are important for speech intelligibility, there is no reported analysis on modulation components from the point of view of speech/noise discrimination. Our data-driven analysis of modulation components of speech and noise reveals that speech and noise is more accurately classified by low-passed modulation frequencies than band-passed ones. Effects of additive noise on the modulation characteristics of speech signals are also analyzed. Based on the analysis, we propose a frequency adaptive modulation processing algorithm for a noise robust ASR task. The algorithm is based on speech channel classification and modulation pattern denoising. Speech recognition experiments are performed to compare the proposed algorithm with other noise robust frontends, including RASTA and ETSI AFE. Recognition results show that the frequency adaptive modulation processing is promising.

#4 Progressive memory-based parametric non-linear feature equalization [PDF] [Copy] [Kimi1]

Authors: Luz Garcia ; Roberto Gemello ; Franco Mana ; Jose Carlos Segura

This paper analyzes the benefits and drawbacks of PEQ (Parametric Non-linear Equalization), a features normalization technique based on the parametric equalization of the MFCC parameters to match a reference probability distribution. Two limitations have been outlined: the distortion intrinsic to the normalization process and the lack of accuracy in estimating normalization statistics on short sentences. Two evolutions of PEQ are presented as solutions to the limitations encountered. The effects of the proposed evolutions are evaluated on three speech corpora, namely WSJ0, AURORA-3 and HIWIRE cockpit databases, with different mismatch conditions given by convolutional and/or additive noise and non-native speakers. The obtained results show that the encountered limitations can be overcome by the newly introduced techniques.

#5 Dynamic features in the linear domain for robust automatic speech recognition in a reverberant environment [PDF] [Copy] [Kimi1]

Authors: Osamu Ichikawa ; Takashi Fukuda ; Ryuki Tachibana ; Masafumi Nishimura

Since the MFCC are calculated from logarithmic spectra, the delta and delta-delta are considered as difference operations in a logarithmic domain. In a reverberant environment, speech signals have trailing reverberations, whose power is plotted as a long-term exponential decay. This means the logarithmic delta value tends to remain large for a long time. This paper proposes a delta feature calculated in the linear domain, due to the rapid decay in reverberant environments. In an experiment using an evaluation framework (CENSREC-4), significant improvements were found in reverberant situations by simply replacing the MFCC dynamic features with the proposed dynamic features.

#6 Local projections and support vector based feature selection in speech recognition [PDF] [Copy] [Kimi1]

Authors: Antonio Miguel ; Alfonso Ortega ; L. Buera ; Eduardo Lleida

In this paper we study a method to provide noise robustness in mismatch conditions for speech recognition using local frequency projections and feature selection. Local time-frequency filtering patterns have been used previously to provide noise robust features and a simpler feature set to apply reliability weighting techniques. The proposed method combines two techniques to select the feature set, first a reliability metric based on information theory and, second, a support vector set to reduce the errors. The support vector set provides the most representative examples which have influence in the error rate in mismatch conditions, so that only the features which incorporate implicit robustness to mismatch are selected. Some experimental results are obtained with this method compared to baseline systems using the Aurora 2 database.

#7 Refactoring acoustic models using variational expectation-maximization [PDF] [Copy] [Kimi1]

Authors: Pierre L. Dognin ; John R. Hershey ; Vaibhava Goel ; Peder A. Olsen

In probabilistic modeling, it is often useful to change the structure, or refactor>/i>, a model, so that it has a different number of components, different parameter sharing, or other constraints. For example, we may wish to find a Gaussian mixture model (GMM) with fewer components that best approximates a reference model. Maximizing the likelihood of the refactored model under the reference model is equivalent to minimizing their KL divergence. For GMMs, this optimization is not analytically tractable. However, a lower bound to the likelihood can be maximized using a variational expectation-maximization algorithm. Automatic speech recognition provides a good framework to test the validity of such methods, because we can train reference models of any given size for comparison with refactored models. We show that we can efficiently reduce model size by 50%, with the same recognition performance as the corresponding model trained from data.

#8 Investigations on convex optimization using log-linear HMMs for digit string recognition [PDF] [Copy] [Kimi1]

Authors: Georg Heigold ; David Rybach ; Ralf Schlüter ; Hermann Ney

Discriminative methods are an important technique to refine the acoustic model in speech recognition. Conventional discriminative training is initialized with some baseline model and the parameters are re-estimated in a separate step. This approach has proven to be successful, but it includes many heuristics, approximations, and parameters to be tuned. This tuning involves much engineering and makes it difficult to reproduce and compare experiments. In contrast to the conventional training, convex optimization techniques provide a sound approach to estimate all model parameters from scratch. Such a straight approach hopefully dispense with additional heuristics, e.g. scaling of posteriors. This paper addresses the question how well this concept using log-linear models carries over to practice. Experimental results are reported for a digit string recognition task, which allows for the investigation of this issue without approximations.

#9 Investigations on discriminative training in large scale acoustic model estimation [PDF] [Copy] [Kimi1]

Author: Janne Pylkkönen

In this paper two common discriminative training criteria, maximum mutual information (MMI) and minimum phone error (MPE), are investigated. Two main issues are addressed: sensitivity to different lattice segmentations and the contribution of the parameter estimation method. It is noted that MMI and MPE may benefit from different lattice segmentation strategies. The use of discriminative criterion values as the measure of model goodness is shown to be problematic as the recognition results do not correlate well with these measures. Moreover, the parameter estimation method clearly affects the recognition performance of the model irrespective of the value of the discriminative criterion. Also the dependence on the recognition task is demonstrated by example with two Finnish large vocabulary dictation tasks used in the experiments.

#10 Margin-space integration of MPE loss via differencing of MMI functionals for generalized error-weighted discriminative training [PDF] [Copy] [Kimi1]

Authors: Erik McDermott ; Shinji Watanabe ; Atsushi Nakamura

Using the central observation that margin-based weighted classification error (modeled using Minimum Phone Error (MPE)) corresponds to the derivative with respect to the margin term of margin-based hinge loss (modeled using Maximum Mutual Information (MMI)), this article subsumes and extends marginbased MPE and MMI within a broader framework in which the objective function is an integral of MPE loss over a range of margin values. Applying the Fundamental Theorem of Calculus, this integral is easily evaluated using finite differences of MMI functionals; lattice-based training using the new criterion can then be carried out using differences of MMI gradients. Experimental results comparing the new framework with margin-based MMI, MCE and MPE on the Corpus of Spontaneous Japanese and the MIT OpenCourseWare/MIT-World corpus are presented.

#11 Compacting discriminative feature space transforms for embedded devices [PDF] [Copy] [Kimi1]

Authors: Etienne Marcheret ; Jia-Yu Chen ; Petr Fousek ; Peder A. Olsen ; Vaibhava Goel

Discriminative training of the feature space using the minimum phone error objective function has been shown to yield remarkable accuracy improvements. These gains, however, come at a high cost of memory. In this paper we present techniques that maintain fMPE performance while reducing the required memory by approximately 94%. This is achieved by designing a quantization methodology which minimizes the error between the true fMPE computation and that produced with the quantized parameters. Also illustrated is a Viterbi search over the allocation of quantization levels, providing a framework for optimal non-uniform allocation of quantization levels over the dimensions of the fMPE feature vector. This provides an additional 8% relative reduction in required memory with no loss in recognition accuracy.

#12 A back-off discriminative acoustic model for automatic speech recognition [PDF] [Copy] [Kimi1]

Authors: Hung-An Chang ; James R. Glass

In this paper we propose a back-off discriminative acoustic model for Automatic Speech Recognition (ASR). We use a set of broad phonetic classes to divide the classification problem originating from context-dependent modeling into a set of sub-problems. By appropriately combining the scores from classifiers designed for the sub-problems, we can guarantee that the back-off acoustic score for different context-dependent units will be different. The back-off model can be combined with discriminative training algorithms to further improve the performance. Experimental results on a large vocabulary lecture transcription task show that the proposed back-off discriminative acoustic model has more than a 2.0% absolute word error rate reduction compared to clustering-based acoustic model.

#13 Efficient generation and use of MLP features for Arabic speech recognition [PDF] [Copy] [Kimi1]

Authors: J. Park ; F. Diehl ; M. J. F. Gales ; M. Tomalin ; P. C. Woodland

Front-end features computed using Multi-Layer Perceptrons (MLPs) have recently attracted much interest, but are a challenge to scale to large networks and very large training data sets. This paper discusses methods to reduce the training time for the generation of MLP features and their use in an ASR system using a variety of techniques: parallel training of a set of MLPs on different data sub-sets; methods for computing features from by a combination of these networks; and rapid discriminative training of HMMs using MLP-based features. The impact on MLP frame-based accuracy using different training strategies is discussed along with the effect on word rates from incorporating the MLP features in various configurations into an Arabic broadcast audio transcription system.

#14 A study of bootstrapping with multiple acoustic features for improved automatic speech recognition [PDF] [Copy] [Kimi1]

Authors: Xiaodong Cui ; Jian Xue ; Bing Xiang ; Bowen Zhou

This paper investigates a scheme of bootstrapping with multiple acoustic features (MFCC, PLP and LPCC) to improve the overall performance of automatic speech recognition. In this scheme, a Gaussian mixture distribution is estimated for each type of feature resampled in each HMM state by single-pass retraining on a shared decision tree. Thus obtained acoustic models based on the multiple features are combined by likelihood averaging during decoding. Experiments on large vocabulary spontaneous speech recognition show its superior overall performance than the best of acoustic models from individual features. It also achieves comparable performance to Recognizer Output Voting Error Reduction (ROVER) with computational advantages.

#15 Analysis of low-resource acoustic model self-training [PDF] [Copy] [Kimi1]

Authors: Scott Novotney ; Richard Schwartz

Previous work on self-training of acoustic models using unlabeled data reported significant reductions in WER assuming a large phonetic dictionary was available. We now assume only those words from ten hours of speech are initially available. Subsequently, we are then given a large vocabulary and then quantify the value of repeating self-training with this larger dictionary. This experiment is used to analyze the effects of self-training on categories of words. We report the following findings: (i) Although the small 5k vocabulary raises WER by 2% absolute, self-training is equally effective as using a large 75k vocabulary. (ii) Adding all 75k words to the decoding vocabulary after self-training reduces the WER degradation to only 0.8% absolute. (iii) Self-training most benefits those words in the unlabeled audio but not transcribed by a wide margin.

#16 Log-linear model combination with word-dependent scaling factors [PDF] [Copy] [Kimi1]

Authors: Björn Hoffmeister ; Ruoying Liang ; Ralf Schlüter ; Hermann Ney

Log-linear model combination is the standard approach in LVCSR to combine several knowledge sources, usually an acoustic and a language model. Instead of using a single scaling factor per knowledge source, we make the scaling factor word- and pronunciation-dependent. In this work, we combine three acoustic models, a pronunciation model, and a language model for a Mandarin BN/BC task. The achieved error rate reduction of 2% relative is small but consistent for two test sets. An analysis of the results shows that the major contribution comes from the improved interdependency of language and acoustic model.

#17 Back-off language model compression [PDF] [Copy] [Kimi1]

Authors: Boulos Harb ; Ciprian Chelba ; Jeffrey Dean ; Sanjay Ghemawat

With the availability of large amounts of training data relevant to speech recognition scenarios, scalability becomes a very productive way to improve language model performance. We present a technique that represents a back-off n-gram language model using arrays of integer values and thus renders it amenable to effective block compression. We propose a few such compression algorithms and evaluate the resulting language model along two dimensions: memory footprint, and speed reduction relative to the uncompressed one. We experimented with a model that uses a 32-bit word vocabulary (at most 4B words) and logprobabilities/ back-off-weights quantized to 1 byte, respectively. The best compression algorithm achieves 2.6 bytes/n-gram at ~18X slower than uncompressed. For faster LM operation we found it feasible to represent the LM at .4.0 bytes/n-gram, and ~3X slower than the uncompressed LM. The memory footprint of a LM containing one billion n-grams can thus be reduced to 3.4 Gbytes without impacting its speed too much.

#18 Improving broadcast news transcription with a precision grammar and discriminative reranking [PDF] [Copy] [Kimi1]

Authors: Tobias Kaufmann ; Thomas Ewender ; Beat Pfister

We propose a new approach of integrating a precision grammar into speech recognition. The approach is based on a novel robust parsing technique and discriminative reranking. By reranking 100-best output of the LIMSI German broadcast news transcription system we achieved a significant reduction of the word error rate by 9.6% relative. To our knowledge, this is the first significant improvement for a real-world broad-domain speech recognition task due to a precision grammar.

#19 Use of contexts in language model interpolation and adaptation [PDF] [Copy] [Kimi1]

Authors: X. Liu ; M. J. F. Gales ; P. C. Woodland

Language models (LMs) are often constructed by building component models on multiple text sources to be combined using global, context free interpolation weights. By re-adjusting these weights, LMs may be adapted to a target domain representing a particular genre, epoch or other higher level attributes. A major limitation with this approach is other factors that determine the “usefulness” of sources on a context dependent basis, such as modeling resolution, generalization, topics and styles, are poorly modeled. To overcome this problem, this paper investigates a context dependent form of LM interpolation and test-time adaptation. Depending on the context, a discrete history weighting function is used to dynamically adjust the contribution from component models. In previous research, it was used primarily for LM adaptation. In this paper, a range of schemes to combine context dependent weights obtained from training and test data to improve LM adaptation are proposed. Consistent perplexity and error rate gains of 6% relative were obtained on a state-of-the-art broadcast recognition task.

#20 Exploiting Chinese character models to improve speech recognition performance [PDF] [Copy] [Kimi1]

Authors: J. L. Hieronymus ; X. Liu ; M. J. F. Gales ; P. C. Woodland

The Chinese language is based on characters which are syllabic in nature. Since languages have syllabotactic rules which govern the construction of syllables and their allowed sequences, Chinese character sequence models can be used as a first level approximation of allowed syllable sequences. N-gram character sequence models were trained on 4.3 billion characters. Characters are used as a first level recognition unit with multiple pronunciations per character. For comparison the CU-HTK Mandarin word based system was used to recognize words which were then converted to character sequences. The character only system error rates for one best recognition were slightly worse than word based character recognition. However combining the two systems using log-linear combination gives better results than either system separately. An equally weighted combination gave consistent CER gains of 0.1–0.2% absolute over the word based standard system.

#21 Constraint selection for topic-based MDI adaptation of language models [PDF] [Copy] [Kimi1]

Authors: Gwénolé Lecorvé ; Guillaume Gravier ; Pascale Sébillot

This paper presents an unsupervised topic-based language model adaptation method which specializes the standard minimum information discrimination approach by identifying and combining topic-specific features. By acquiring a topic terminology from a thematically coherent corpus, language model adaptation is restrained to the sole probability re-estimation of n-grams ending with some topic-specific words, keeping other probabilities untouched. Experiments are carried out on a large set of spoken documents about various topics. Results show significant perplexity and recognition improvements which outperform results of classical adaptation techniques.

#22 Nonstationary latent Dirichlet allocation for speech recognition [PDF] [Copy] [Kimi1]

Authors: Chuang-Hua Chueh ; Jen-Tzung Chien

Latent Dirichlet allocation (LDA) has been successful for document modeling. LDA extracts the latent topics across documents. Words in a document are generated by the same topic distribution. However, in real-world documents, the usage of words in different paragraphs is varied and accompanied with different writing styles. This study extends the LDA and copes with the variations of topic information within a document. We build the nonstationary LDA (NLDA) by incorporating a Markov chain which is used to detect the stylistic segments in a document. Each segment corresponds to a particular style in composition of a document. This NLDA can exploit the topic information between documents as well as the word variations within a document. We accordingly establish a Viterbi-based variational Bayesian procedure. A language model adaptation scheme using NLDA is developed for speech recognition. Experimental results show improvement of NLDA over LDA in terms of perplexity and word error rate.

#23 Multiple text segmentation for statistical language modeling [PDF] [Copy] [Kimi1]

Authors: Sopheap Seng ; Laurent Besacier ; Brigitte Bigi ; Eric Castelli

In this article we deal with the text segmentation problem in statistical language modeling for under-resourced languages with a writing system without word boundary delimiters. While the lack of text resources has a negative impact on the performance of language models, the errors introduced by the automatic word segmentation makes those data even less usable. To better exploit the text resources, we propose a method based on weighted finite state transducers to estimate the N-gram language model from the training corpus on which each sentence is segmented in multiple ways instead of a unique segmentation. The multiple segmentation generates more N-grams from the training corpus and allows obtaining the N-grams not found in unique segmentation. We use this approach to train the language models for automatic speech recognition systems of Khmer and Vietnamese languages and the multiple segmentations lead to a better performance than the unique segmentation approach.

#24 Measuring tagging performance of a joint language model [PDF] [Copy] [Kimi1]

Authors: Denis Filimonov ; Mary Harper

Predicting syntactic information in a joint language model (LM) has been shown not only to improve the model at its main task of predicting words, but it also allows this information to be passed to other applications, such as spoken language processing. This raises the question of just how accurate the syntactic information predicted by the LM is. In this paper, we present a joint LM designed not only to scale to large quantities of training data, but also to be able to utilize fine-grain syntactic information, as well as other features, such as morphology and prosody. We evaluate the accuracy of our model at predicting syntactic information on the POS tagging task against state-of-the-art POS taggers and on perplexity against the ngram model.

#25 Improved language modelling using bag of word pairs [PDF] [Copy] [Kimi1]

Authors: Langzhou Chen ; K. K. Chin ; Kate Knill

The bag-of-words (BoW) method has been used widely in language modelling and information retrieval. A document is expressed as a group of words disregarding the grammar and the order of word information. A typical BoW method is latent semantic analysis (LSA), which maps the words and documents onto the vectors in LSA space. In this paper, the concept of BoW is extended to Bag-of-Word Pairs (BoWP), which expresses the document as a group of word pairs. Using word pairs as a unit, the system can capture more complex semantic information than BoW. Under the LSA framework, the BoWP system is shown to improve both perplexity and word error rate (WER) compared to a BoW system.